{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import relevant packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import KFold\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from scipy.stats import norm\n",
    "from sklearn.linear_model import LassoCV, LinearRegression\n",
    "import patsy\n",
    "import warnings\n",
    "import statsmodels.api as sm\n",
    "warnings.simplefilter('ignore')\n",
    "np.random.seed(1234)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.011184,
     "end_time": "2021-02-28T17:17:25.882205",
     "exception": false,
     "start_time": "2021-02-28T17:17:25.871021",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Application: Heterogeneous Effect of Sex on Wage Using Double Lasso\n",
    "\n",
    "We use US census data from the year 2015 to analyse the effect of sex and interaction effects of other variables with sex on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  For a detailed description of the variables we refer to the help page.\n",
    "\n",
    "\n",
    "\n",
    "This analysis allows a closer look how discrimination according to sex is related to other socio-economic variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\"\n",
    "data = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define outcome and regressors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.log(data['wage']).values\n",
    "Z = data.drop(['wage', 'lwage'], axis=1)\n",
    "Z.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct all our control variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ultra flexible controls of all pair-wise interactions (around 1k variables); un-comment to run this\n",
    "Zcontrols = patsy.dmatrix('0 + (shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we+exp1+exp2+exp3+exp4)**2',\n",
    "                           Z, return_type='dataframe')\n",
    "\n",
    "Zcontrols = Zcontrols - Zcontrols.mean(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct all the variables that we will use to model heterogeneity of effect in a linear manner"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Zhet = patsy.dmatrix('0 + (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',\n",
    "                      Z, return_type='dataframe')\n",
    "Zhet = Zhet - Zhet.mean(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct all interaction variables between sex and heterogeneity variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Zhet['sex'] = Z['sex']\n",
    "Zinteractions = patsy.dmatrix('0 + sex + sex * (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',\n",
    "                               Zhet, return_type='dataframe')\n",
    "interaction_cols = [c for c in Zinteractions.columns if c.startswith('sex')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put all the variables together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = pd.concat([Zinteractions, Zcontrols], axis=1)\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Double Lasso for All Interactive Effects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose our penalized lasso regression model either cross validated lasso (see end of notebook for lasso based on the theoretically driven penalty)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "lasso_model = lambda: make_pipeline(StandardScaler(), LassoCV(cv=cv))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each target predictive effect estimate it via the partialling out process and calculate the quantities needed for the covariance calculation, which is the residual outcome, the residual target variable and the final stage residual epsilon."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alpha = {}\n",
    "res_y, res_D, epsilon = {}, {}, {}\n",
    "for c in interaction_cols:\n",
    "    print(f\"Double Lasso for target variable {c}\")\n",
    "    D = X[c].values\n",
    "    W = X.drop([c], axis=1)\n",
    "    res_y[c] = y - lasso_model().fit(W, y).predict(W)\n",
    "    res_D[c] = D - lasso_model().fit(W, D).predict(W)\n",
    "    final = LinearRegression(fit_intercept=False).fit(res_D[c].reshape(-1, 1), res_y[c])\n",
    "    epsilon[c] = res_y[c] - final.predict(res_D[c].reshape(-1, 1))\n",
    "    alpha[c] = [final.coef_[0]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Co-Variance Calculation and Confidence Intervals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate the covariance matrix of the estimated parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "V = np.zeros((len(interaction_cols), len(interaction_cols)))\n",
    "for it, c in enumerate(interaction_cols):\n",
    "    Jc = np.mean(res_D[c]**2)\n",
    "    for itp, cp in enumerate(interaction_cols):\n",
    "        Jcp = np.mean(res_D[cp]**2)\n",
    "        Sigma = np.mean(res_D[c] * epsilon[c] * epsilon[cp] * res_D[cp])\n",
    "        V[it, itp] = Sigma / (Jc * Jcp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate standard errors for each parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = X.shape[0]\n",
    "for it, c in enumerate(interaction_cols):\n",
    "    alpha[c] += [np.sqrt(V[it, it] / n)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put them all in a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame.from_dict(alpha, orient='index', columns=['point', 'stderr'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate and pointwise p-value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = pd.DataFrame()\n",
    "summary['Estimate'] = df['point']\n",
    "summary['Std. Error'] = df['stderr']\n",
    "summary['p-value'] = norm.sf(np.abs(df['point'] / df['stderr']), loc=0, scale=1) * 2\n",
    "summary['ci_lower'] = df['point'] - 1.96 * df['stderr']\n",
    "summary['ci_upper'] = df['point'] + 1.96 * df['stderr']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Joint Confidence Intervals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Drootinv = np.diagflat(1/np.sqrt(np.diag(V)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaledCov = Drootinv @ V @ Drootinv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(123)\n",
    "U = np.random.multivariate_normal(np.zeros(scaledCov.shape[0]), scaledCov, size=10000)\n",
    "z = np.max(np.abs(U), axis=1)\n",
    "c = np.percentile(z, 95)\n",
    "c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = pd.DataFrame()\n",
    "summary['Estimate'] = df['point']\n",
    "summary['CI lower'] = df['point'] - c * df['stderr']\n",
    "summary['CI upper'] = df['point'] + c * df['stderr']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also set the penalized lasso model to be estimated based on the theoretically motivated penalty level using the hdmpy package. To install it run\n",
    "`pip install multiprocess`\n",
    "`git clone https://github.com/maxhuppertz/hdmpy.git`\n",
    "\n",
    "You can run the cells below and then repeat the whole analysis above using the newly defined `lasso_model` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.insert(1, \"./hdmpy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We wrap the package so that it has the familiar sklearn API\n",
    "import hdmpy\n",
    "from sklearn.base import BaseEstimator, clone\n",
    "\n",
    "class RLasso(BaseEstimator):\n",
    "    \n",
    "    def __init__(self, *, post=True):\n",
    "        self.post = post\n",
    "    \n",
    "    def fit(self, X, y):\n",
    "        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])\n",
    "\n",
    "lasso_model = lambda: RLasso(post=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 89.365707,
   "end_time": "2021-02-28T17:18:51.003711",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-02-28T17:17:21.638004",
   "version": "2.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
